This midterm project aims to create a more accurate and generalized model of house price prediction in San Francisco, California for Zillow. As a hot topic, house price is always the concern for not only property owners and renters, but also for the government to provide a series of public services such as tax assessment. An accurate and generalized house price prediction model could allow us to perform private activities such as buying or renting with more reliable information. It also allows the government to serve with better social equity.
The exercise for creating a good model is a challenge though. First of all, no one really knows the correct values of the properties. People determine to buy houses with a certain amount of money for multiple reasons. It is sometimes hard to name the reasons, not to mention to quantify them. Secondly, the variables collected for the model are more or less decided upon the machine learners opinions. It is challenging to dig out the golden variables among the piles of data online. Also, linear regression might not necessarily be an effective way to predict the house prices due to the complicated relationship between house prices and independent variables. Last but not lease, San Francisco is a global city with a lot of diversity happening inside each neighborhood which makes it even harder to create a good prediction.
The overall modeling strategy is using the hedonic model to deconstruct house prices into a group of physical characteristics such as the number of bedrooms and property areas, as well as a group of places-based characteristics such as the average distance to nearest two crimes and distance to airport/park. Also, the model includes the features spatially determine the qualities that houses are located.
To summarize, the model produced in this project is effective and has a good generalizability. It performs well for the houses that have middle house prices. For houses that are with very high prices, the error might be larger. In general, the model could be used as a reference for Zillow.
For the internal characteristics of the houses, such as the number of bedrooms or the property areas, the data is directly from Zillow dataset. For the amenities or the public services the houses are exposed to, most of the initial data is from Open Data Sanfrancisco website. Some feature engineering happened during the transformation of the data. For instance, to measure the school service each house can get, the distance to the nearest school is measured. In the feature engineering, other data is also transformed into the measurement of the exposure to crimes, homeless concerns, as well as the distances to highways and parks, etc. For the spatial structure, some of them are collected from ACS while some are from Open Data San Francisco.
This section includes the description of all the variables that we collected. Some variables might be excluded from the model after observing the correlation matrix. The SalePrice is the dependent variable that the model is dealing with. It ranges from $100,001 to $4,750,003 from the current observation we have.
I. Internal characteristics
For internal characteristics, the property area and lot area are used as continuous variables. Built year is transformed into a binary variable of 0 or 1 where 1 represents the properties built before 1938 or after 1965, 0 represents the properties built between 1938 and 1965.
The number of bedrooms, the number of bathrooms and the number of stories are transformed into categorical data. Bedroom categories are 1 Bed, 2 Beds, 3-5 Beds and 6+ Beds. Bathroom categories are 1 Bath, 2 Baths, 3-5 Baths and 6+ Baths. Story categories are 1 Floor, Up to 3 Floors, 4 Floors and 4+ Floors.
II.Amenities & public services
For amenities and public services, there are continuous variables as below: the distance to the nearest homeless concern report, the average distance to the two nearest crimes, the average distance to the three nearest fire incidents, the distance to the nearest school, the distance to the nearest park, the distance to the nearest art building, the distance to the nearest hospital, the average distance to the three nearest retail stores, the average distance to the three nearest restaurants, the average distance to the five nearest bus stations, the distance to the nearest BART station, the average distance to the five nearest evictions, the distance to highway, the distance to arterial roads, the distance to historic districts.
There is one binary variable, the airport buffer, in which 0 represents the properties are beyond 10 miles of the airport while 1 represents the properties are within 10 miles of the airport. The distance of 10 miles is decided well-accepted knowledge of the noise and air pollution regarding the distance to the airport.
III.Spatial structure
The spatial structure variables contain following continuous variables: the area of tree canopy, population density, percentage of bachelor degree, median income, percentage of household and percentage of vacancy. There are also two binary variables. In the slope variable (“great20”), 0 representing less than 20 degrees while 1 represents greater than 20 degrees. In the race variable (“MajorityWhite”), 0 representing less than 50% white population while 1 presenting more than 50% white population.
##
## ===============================================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## -----------------------------------------------------------------------------------------------
## SalePrice 9,403 1,145,288.000 701,138.400 100,001 695,001.5 1,380,003 4,750,003
## Homeless_nn1 9,403 1,943.167 1,445.771 21.965 849.615 2,654.056 9,981.013
## Crime_nn2 9,403 295.042 115.153 62.671 208.126 353.615 1,009.099
## Fire_nn3 9,403 2,331.342 1,225.599 171.914 1,498.301 2,801.456 7,700.580
## School_nn1 9,403 907.436 473.959 24.227 545.918 1,201.668 3,320.439
## Park_nn1 9,403 1,292.545 618.394 43.515 818.260 1,710.708 4,891.932
## Airports_Buffer 9,403 0.828 0.377 0 1 1 1
## ArtBuilding_nn1 9,403 2,484.204 1,248.125 39.529 1,566.045 3,229.651 7,121.239
## Hospital_nn1 9,403 6,935.752 4,102.276 122.779 3,375.781 10,517.740 15,971.210
## Retail_nn3 9,403 533.504 260.655 62.043 343.275 677.475 3,079.878
## Food_nn3 9,403 736.380 392.323 69.011 439.233 958.958 2,781.677
## Bus_nn10 9,403 508.458 226.660 26.269 338.397 641.557 1,597.109
## BART_nn1 9,403 8,674.266 5,769.163 268.375 4,083.046 12,113.660 26,623.030
## distance_highway1 9,403 7,134.042 4,938.319 40.866 3,120.113 10,157.950 21,661.440
## Evictions_nn5 9,403 565.146 290.328 149.573 376.773 661.515 2,944.874
## great20 9,403 0.217 0.412 0 0 0 1
## distance_Arterial 9,403 710.261 533.907 35.469 278.583 1,016.486 2,970.281
## distance_Historic 9,403 15,995.690 6,474.052 75.346 10,695.400 21,380.120 30,141.130
## LotArea 9,403 279,746.500 102,306.200 18.000 237,400.000 300,000.000 1,890,500.000
## PropArea 9,403 1,642.193 703.976 187 1,160 1,978 7,679
## BYear 9,403 0.333 0.471 0 0 1 1
## Tree 9,403 7,180.351 34,091.010 0.000 1,632.773 6,833.962 818,378.200
## Med_Income 9,403 110,471.800 33,062.720 0 82,734 137,969 195,375
## Pct_bachelor 9,403 0.545 0.202 0.000 0.382 0.711 0.892
## Pct_hhold 9,403 0.602 0.153 0.000 0.484 0.722 0.863
## Pct_vacancy 9,403 0.058 0.034 0.000 0.037 0.080 0.236
## Pop_den 9,403 22,927.060 8,964.998 0.000 17,032.230 27,687.410 108,640.300
## MajorityWhite 9,403 0.438 0.496 0 0 1 1
## -----------------------------------------------------------------------------------------------
According to the correlation matrix, most variables do not have collinearity. The distance to the nearest hospital (hospital_nn1) has some correlation with distance to historic district. Since these two variables are not theoretically related, they are both included in the model. The percentage population with bachelor’s degree is correlated with hospital, distance to historic district, median income and percentage of household. So the percentage population with bachelor’s degree is excluded from the model building.
The four factors of interest selected here are the distance to the two nearest crime, the distance to the three nearest restaurants, median income and property area. According to the scatterplots, as the distance to the two nearest crime becomes longer, the house price becomes higher. As the distance to the three nearest restaurants get longer, the house price stays stable with a tendency to decline. As the median income in a census tract which the house is located in gets higher, the house price gets higher. As the property area gets larger, the house price gets higher.
It is clear to see in the sale price map that, the highest house prices gathered around the center area of San Francisco such as neighborhoods Corona Heights, Dolores Heights, Sherwood Forest, as well as the north area of San Francisco besides the Presidio and the sea such as neighborhoods Presidio Heights and Marina. The lowest house prices gathered around the south part of San Francisco and the west part of San Francisco.
I. Average distance to 2 nearest crime locations
I.I Intereting Finding —— Type of Crime
The chart below shows the categories of crime incidents happened in San Francisco from 2012 to 2015. From all the categories, we chose the followings to include in our variable: Assault, Burglary, Robbery and Rape.
The map below shows the average distance from each house to its two nearest crime incidents. If we define a neighborhood with shorter distance to crime as having a relatively more unsafe environment, those neighborhoods are gathered around northeast San Francisco and southeast part of San Francisco, including Russian Hills, Telegraph Hill, Candlestick Point, etc. The neighborhoods which have a longer distance to crime that could be defined as safer neighborhoods are the neighborhoods along the west coast and generally west part of San Francisco. Judged from this map and the map above, a longer distance to crime incidents might not be an effective factor to influence house price, but a shorter distance to crime is promising to have negative effects on house price.
II. Median Household Income
Median income here is described as six different levels without overlapping. According to the map, the center area of San Francisco and some part of the northern area have relatively higher median income level. Associated with the sale price map, higher income level areas normally have higher house prices while lower income areas normally have lower house prices.
III. Average distance to nearest hospital
III.1 Intereting Finding —— Type of health care facilities
The following chart shows all the health care facilities in San Francisco. Considering the importance they have for houses, only the general acute care hospitals are kept as the variable.
According to the map below, the distribution of hospitals in San Francisco is spatially clustered since the center and the northeast area have a closer distance to the nearest hospital than the west and southeast part of San Francisco. With the sale price map, the distance to hospital map reflect less correlation between house price and distance to hospital since the shorter distance area have both high price neighborhoods and low price neighborhoods.
The general method used in this project is to build a hedonic model which deconstruct house price as three parts (internal characteristics, amenities/public services and spatial structure) and then connect the house price mathematically to these three parts.
I. Data Wrangling
In this part, different data is collected from available resources such as San Francisco Open Data, ACS and Berkeley Geo Library. There are three parts of the variables: internal characteristics, amenities/public services and spatial structure. Internal characteristics represents the houses internal features such as the number of bedrooms. Amenities/ public services represents the services the houses can get in the nearby environment such as schools or parks or crimes. Spatial structure represents more of the quality of the area that the houses are located in, such as the slope or the demographic conditions.
II. Feature Engineering
After gathering all the data, some necessary process of data transformation makes the data convert to variables in an easier way to achieve the mathematic connection to house prices. For example, the crime data collected from San Francisco Open Data is geo points on the maps. To relate them with the houses, the average distance to some nearest crime incidents points from the houses is measured. Another example is the airport buffer. If a house is within 10 miles distance from the airport, it will get a value of 1 representing it is within the buffer. If not, it will get a value of 0. These variables will all contribute to the final built-up of the hedonic model.
III. Correlation and Multicollinearity
In this section, the relationship between each two variables are tested. If two variables are highly correlated, they may have very similar representations statistically so it is meaningless to include both of them. Some variables that are highly correlated are removed after consideration.
IV. Regression Model
Using the available variables left, a linear regression model is built to show the connection between house price and the three variable categories. Here the model has 25 variables among all the categories. The original model does not include neighborhood effect. Since some errors are clustered in specific neighborhoods, neighborhood effect is included in the model to improve its performance. The model could explain about 63% of the variations in the sale price while there is an approximate $293,574.7 prediction error existed in the prediction.
V. Cross Validation
To get an accurate and generalized model, cross validations divided the observation data into two sets: training (60% of the data) and test (the rest 40%). The idea is to use the current variables to test on the test sets to see if every randomly chosen test set could be explained well by the model. After 100 folds of the test, around 64% of the variance in sale price could be explained by the model.
According to the summary of five different tests, the R^2 gathered around 0.62-0.65 while the MAE is around $293,000 and the MAPE is around 29%.
| Test | R.2 | MAE | MAPE |
|---|---|---|---|
| 1 | 0.6266 | 294814.6 | 28.82% |
| 2 | 0.6465 | 295499.5 | 30% |
| 3 | 0.6362 | 293152.8 | 29.47% |
| 4 | 0.6389 | 293089.3 | 29.64% |
| 5 | 0.6423 | 293975.2 | 30.51% |
I. Cross-validation Results
Upon the 100 folds of the cross-validation tests, the RSME indicates that there are some large errors in the model prediction since there is a large difference between RSME and the MAE. According to the RSME, the error is $422,131.5.
| RMSE | R.2 | MAE |
|---|---|---|
| 422131.5 | 0.6389699 | 293574.7 |
II. Histogram of the cross-validation MAE
The histogram of the cross-validation MAE demonstrates that errors are mostly gathered around $260,000 to $320,000.
As shown in the plot below, the black points are the real sale price in the dataset while the pale green line represents the perfect prediction where predicted value equals observed value. The blue line is the actual prediction the model provides. It’s clear that when the sale price is around $1,000,000, the model predicts with relative less residuals. When the saleprice is below $1,000,000, the model tends to over predict the house price. When the sale price is above $1,000,000, the model tends to underpredict the house price. Since there are some houses with very high prices, the model’s performance on higher value houses are poorer.
As said in the previous section, the residuals mostly appear at the places where the house prices are very high. These areas are the center area of San Francisco and the northern areas. For the areas where house prices are relatively low, the residuals are relatively small.
The Moran’s I produced from the test set by this model is close to zero which represents a spatial randomness. The observed Moran’s I is in red but it is not higher than all the 999 randomly generated permutations. Thus, the Moran’s I indicates that the model includes some features that can represent the house price’s spatial structure although this part may still miss some factors.
## Warning: Removed 2 rows containing missing values (geom_bar).
The map below shows the predicted house prices by the model created in this project. The general spatial trends match the current known house price trends. The highest house price appears around the center area of San Francisco and the northern part of San Francisco. The lower prices appear around the west side of San Francisco and the south side of San Francisco. However, if we look into the map, it could be easily found that inside each neighborhood, the price variance is very small. This is due to the limitation of the data scale as many of them are neighborhood scale or census tract scale. The extreme case for this is Hunter Points, which has a MAPE of 2.24 and only have around five house price data points.
The map below demonstrates the MAPE by neighborhood in San Francisco. According to the map, it is clear that the west part of San Francisco generally has a lower MAPE than the east part of the city. Some neighborhoods that have high house prices normally have a higher MAPE. Other neighborhoods in the south which have high MAPEs have very little house sale price data so that the estimation is problematic.
According to the scatterplot below, the MAPE remains relatively stable as the mean sale price varies. It has a slight trend to decline as mean sale price goes up. The scatterplot indicates that the model has an overall stable goodness of fit among the neighborhoods with the mean price between $1,000,000 and $2,000,000 regardless of the price variation. In another words, the model has a good generalizability among different house price neighborhoods.
With ACS racial data coming from tidycensus, a race context is printed as below with two categories: majority white and majority non-white. As stated above, the center area and north area of San Francisco have higher house price generally. Here in the race context, these areas mostly have majority white population.
The table below shows the MAE for majority white neighborhoods and majority non-white neighborhoods. In the model set up for this project, the MAE for majority non-white neighborhoods is around 0.05 lower than majority white population. Although this represents the possible inequity for white people, the small difference also shows the generalizability of the model.
| Majority Non-White | Majority White |
|---|---|
| 0.2839602 | 0.3018766 |
In general, it is an effective model as it has a high R-square of 0.64, which means the model could result in 64% of variations in housing sale price. In addition, the model is generalizable and can be used to predict sale price among various neighborhoods in terms of the race different and income difference. The interesting variables used here are the distance to the two nearest crime incidents, median income and the distance to the nearest hospital. Around 64% of variation in home sale prices can be predicted with this model. The most important predictor variable is the distance to the nearest hospital as the p-value of this variable is less than 0.001, which means it is very significant. With the Moran’s I test and MAPE map by neighborhood, it’s obvious that distribution of residuals is auto-correlated spatially. In other words, spatial variation in prices can be accounted.
In order to make sure the generalizability of the model, we did cross-validation test and the standard deviation MAE is small enough to indicate the model is generalizable. Overall, the model predict particularly well outside the areas where the house prices are extremely high. There are several reasons for the poor performance in the high house price areas. For instance, the process of data wrangling and feature engineering are relatively subjective. Besides, the data that are accessible might not be representative enough for house price prediction. Lastly, the linear regression model might not be an effective option for those houses with extremely high prices.
We would highly recommend this model to Zillow as our model is very generalizable. To make the model more sustainably usable, the update of the data is necessary. Moreover, we need to consider special characteristics of various cities and add the data that can represent the cities to our model in order to perform better.